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Abstract — We introduce GP-FNARX: a new model for non- 
linear system identification based on a nonlinear autoregressive 
exogenous model (NARX) with filtered regressors (F) where 
the nonlinear regression problem is tackled using Gaussian 
processes (GP). We integrate data pre-processing with system 
identification into a fully automated procedure that goes from 
raw data to an identified model without human intervention. 
Moreover, we obtain a Bayesian model of the system's dynamics 
which is able to report its uncertainty in regions where the data 
is scarce. The automated approach, the modeling of uncertainty 
and its relatively low computational cost make GP-FNARX a 
good candidate for applications in robotics and adaptive control 
tasks. 

I. Introduction 

System identification consists in building mathematical 
models of dynamical systems based on observed input and 
output signals. Those signals are often contaminated by noise 
and do not necessarily explore the whole operating range of 
the system. Yet, system identification methods are required 
to find a model that faithfully explains the observed data and 
has graceful generalization capabilities. 

In this paper we will be particularly concerned with 
identifying models of nonlinear systems when only a limited 
amount of data are available. Data could be scarce due to 
the high cost of experimentation, vast operating range of 
the system, presence of closed loop control or many other 
reasons. As a consequence, we will require a model that can 
accurately describe the system wherever data is present but 
which can also report its own ignorance in regions where no 
data can support its claims. This naturally leads to the use 
of a Bayesian framework for the identification of the system 
dynamics [1]. 

A very common and general representation of a dynamical 
system is in terms of a state space model. The future 
behavior of the system is fully described by its current state 
x and external input u. In statistical terms, future states are 
conditionally independent of past states and inputs given the 
current ones 

p(x t+ i\x t ,x t -i, ...,u t ,ut-i, — ) =p(xt+i\x t ,u t ). (1) 

A system with this characteristic is said to have the Markov 
property. 

The state of a system is often not perfectly observed. Only 
a potentially noisy and incomplete observation is available. 
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An observation model is usually defined as a probability den- 
sity of an observation y t conditioned on the state at the same 
time step: p{yt\xt)- In the particular case where a nonlinear 
state space model has additive independent Gaussian noise 
in both the state transition and the observation, the model 
can be presented as 

x t+1 = f(x t ,u t ) + 5 U 6 t ~M(0,Q) (2) 

y t = g{x t ) + e u e t ~Af(0,R). (3) 

For time-invariant systems, / and g represent fixed deter- 
ministic functions although both the state transition and the 
observation model are stochastic difference equations (2|3 1, 
In statistical terms, the system identification problem can 
be specified as finding the state transition probability condi- 
tioned on the observed inputs and outputs 



P(Xt + l\Xt,U U U 1: N,yi:N), t > N. 



(4) 



The fact that the states are not observed makes this dis- 
tribution exceedingly difficult to compute. In particular, if 
one wants to obtain a model that is able to report its own 
uncertainty (i.e. a Bayesian model), there exist no closed 
form solutions even for the simplest linear-Gaussian state 
space models [2], [3]. 

A popular alternative to state space models can be found in 
autoregressive exogenous (ARX) models [4]. These models 
represent the system based only on observable quantities: the 
inputs and the outputs. A nonlinear ARX (NARX) model 
with Gaussian innovations provides an estimate of the next 
output based on a finite amount of previous inputs and 
outputs 

Vt = f{y t -i,yt-2,-,ut-i,-) + $t, 6 t ~Af(0,Q). (5) 

In contrast with state space models, there are no latent 
states and the model does not have the Markov property. In 
practical terms, this representation based only on observables 
makes it possible to sidestep many of the problems associated 
with system identification when latent states are present. 

An important drawback of autoregressive models is that 
they do not account for observation noise. If we consider 
equation [5] in a generative manner by assuming a known 
/ and sequentially generate new values of y t , yt+i, it is 
apparent that all randomness injected through the innovation 
S has an influence on the future trajectory of the system. 
In other words, the innovation can be considered a form of 
process noise. The model does not accommodate for any kind 
of observation noise such as the one described in equation [3] 
for the case of state space models. 

A common way to deal with observation noise is to 
pre-process the measured input and output data to remove 



as much noise as possible before identifying the NARX 
model [4]. For instance, low -pass filtering the signals can 
remove a significant component of their noise. For optimal 
performance, the filter needs to be tuned to the particular 
characteristics of the noise. 

We propose an automated approach which interleaves filter 
tuning with learning of a Bayesian model of the system 
dynamics. In this way, noise modeling is performed in a 
completely data-driven fashion. Our method is conceptually 
close to state space models in the sense that we consider that 
observations are corrupted by noise and that there are latent 
variables inaccessible to us. However, our method is simpler 
since the latent space is the space of observables. 

In section |ll] we present an overview of the field of 
Bayesian system identification together with a brief intro- 
duction to Gaussian processes. 



Ill 



we 



Then, in section 
describe our approach for system identification integrating 
data preprocessing with learning the system dynamics. In 
section IV we compare our approach to related work. Then 



we show experimental results in section [V] and, finally, in 
section 



VI we present the conclusions. 



II. Background Theory 

A. Bayesian System Identification 

The goal of Bayesian system identification is to obtain 
models of dynamical systems that quantify the degree of un- 
certainty in the systems' dynamics [1]. Modeling uncertainty 
can be particularly useful in applications such as robotics or 
adaptive control where decisions about future control actions 
will be affected by how confident the agent is about the 
system dynamics. 

Bayesian system identification relies on the use of 
Bayesian statistics and hence interprets probabilities as de- 
grees of belief. Such an approach to probabilistic modeling 
has been very successful in the fields of machine learning 
and artificial intelligence [5]. We will borrow heavily from 
advances in those fields to model uncertainty in a principled 
manner. In particular, we will solve the nonlinear regression 
task from a NARX model with Bayesian nonlinear regression 
implemented with Gaussian processes. 

B. Gaussian Process Regression 

Gaussian processes are stochastic processes that have 
proven very successful at the task of nonlinear regression. 
Their unbounded flexibility makes them ideal whenever it is 
hard to specify a parametric form for an unknown nonlinear 
function. 

Formally, a Gaussian process is a collection of random 
variables, any finite number of which have a joint Gaussian 
distribution [6]. We say that a random function f(x) e M, 
with x € R", follows a Gaussian process and write it as 



f~GV(m(x),k(x,x')) 



(6) 



when all values of / at any locations x are jointly normally 
distributed. m{x) and k(x,x') represent the mean function 
and the covariance function respectively. Together, they com- 
pletely specify the Gaussian process and can be parametrized 




Fig. 1. Gaussian process regression. Black dots represent the data. The 
red line is the mean of the Gaussian process prediction and the area shaded 
in red covers the 95% confidence region. The three black lines are random 
functions drawn from the posterior that could have generated the data. 



by a, typically small, set of hyper-parameters. Those hyper- 
parameters define the kind of functions that one expects to 
see. For instance, they control the degree of smoothness or 
periodicity but do not constrain the function to have any 
predefined shape. 

In Gaussian process regression, a likelihood function 
p(y\f) relates the vector of real-valued observations y = 
(yi, j/at) with the latent function /. In the particular case 
where observations are contaminated with additive Gaussian 
i.i.d. noise 



p{Vi\fi) = N{y % \f ll <j 2 ) 1 



(7) 



there exists a convenient closed-form solution for the poste- 
rior distribution over functions, i.e. the distribution over the 
unknown function f(x) after incorporating all the observed 
data. The posterior distribution of the function at any new 
location x* is described by 

p(f*\x*,y,X)=N'(f*,cov(f*)) (8) 

where X — {xi, ...,xn} is the set of regressor vectors. In 
the common case where m(x) = we obtain 

/, =K(X«,X)[K(X,X)+a 2 I]- 1 y, (9) 
cov(/„) =K(X m ,X m )- 

K(X., 1 X)[K(X,X) + (T 2 I}- 1 K(X,X.«). (10) 

where the covariance matrices K are constructed by applying 
the covariance function to all elements in each of the sets 
that they take as arguments: 

K(A,B) ij = k(a i ,b j ). (11) 

In Figure [T] we show an example of Gaussian process 
regression. Given a set of seven noisy data points, the goal 
is to infer the latent function that generated them. We do 
not assume any parametric form for this latent function. Our 
assumption is that the latent function was a draw from a 
Gaussian process with zero mean and a squared exponential 
covariance function [6]. To obtain the posterior distribution 



over the latent function we use equations (|9]l and ( 10 1. Note 
how the confidence region widens in areas that are far from 
the data points. 



C. Model Selection for Gaussian Processes 

In the Gaussian process framework, model selection refers 
to using data to select the functional form for the mean and 
the covariance functions as well as selecting the values for 
their hyper-parameters. In addition, the likelihood function 
may also have some hyper-parameters, such as the noise 
variance a 2 in equation For convenience we include the 
likelihood hyper-parameters into the vector 9. 

For Gaussian process models, it is very attractive, from a 
computational and theoretical point of view, to use Bayesian 
model selection procedures. In particular, maximization of 
the marginal likelihood of the Gaussian process model has 
proven to be very effective [6]. The objective is to find 



#Opt = argmax p(y\X,6) (12) 



p(y\X,6)= P (y\f,0) p(f\X,9) df (13) 



where 



is the marginal likelihood. The first term inside the integral 
is the likelihood function and the second term is a normal 
distribution that can be obtained directly from the mean and 
covariance functions that specify the Gaussian process. For 
the particular case of Gaussian processes with zero mean and 
a Gaussian likelihood, such as in eq. Q, we obtain: 

p(y\X,6) = X r {0 1 K e (X 1 X) + a 2 I). (14) 

To avoid underflow errors, it is common to work with 
the logarithm of the marginal likelihood rather than the 
marginal likelihood itself. Since the logarithm function is 
strictly monotonically increasing, this transformation does 
not change the location of the optimum. If we take logarithms 
on both sides of the equation we obtain 



1 



T ry-1 



\ogp{y\X,6) = --y s K 



2 / --log|if|--log2 7 r (15) 



where K = K S (X, X) + a 2 1. 

For optimization, it will be useful to have the derivative of 
the marginal likelihood with respect to any of the parameters 
of interest: 

d 



88 



logp(y\X(u),6) 



y T K 



tr(K 



de n ' 



K~ 



y 



(16) 



The marginal likelihood embodies what is usually de- 
scribed as Bayesian Occam's razor: an automatic balancing 
of model fit and model complexity [6]. In contrast with 
maximum likelihood estimation, the marginal likelihood is 
found by integrating over the latent values of the function 
f. 

The computational complexity of the marginal likelihood 
in eq. (14i is 0(N 3 ) due to the inverted covariance matrix 
in the density of a multivariate Gaussian distribution. This 
cubic complexity effectively limits the approach to values 
of N not greater than a few thousands. However, there is a 
mitigating factor: once K~ x has been computed for eq. (15 1, 



each derivative of the marginal likelihood with respect to the 
hyper-parameters has a computational complexity of 0(N 2 ). 
This is a particular feature of the marginal likelihood that is 
useful for efficient gradient based optimization. 

D. Sparse Gaussian Process Regression 

Several approaches have been proposed to accelerate 
Gaussian processes for very large data sets. Those ap- 
proaches aim at reducing the particularly disadvantageous 
0(N 3 ) cost to train the hyper-parameters but also target a 
reduction of the O(N) cost necessary to evaluate the mean 
of a posterior GP (eq. [9]) or the 0(N 2 ) cost of computing 
its posterior covariance (eq. 



10) 



A common strategy is to introduce a new set of M latent 
variables / at input locations X where M < N [7]. The 
intuition behind these methods is that the new set of latent 
variables incorporates much of the information present in the 
original, large, dataset. 

A particularly effective approximation method named 
FITC, or originally SPGP, was proposed by Snelson and 
Ghahramani [8]. Due to space constraints we avoid a com- 
plete re-derivation of the FITC method and simply state 
its predictive distribution given a dataset {y, X} and the 
locations of the latent inputs X: 



p(f*\x*,y,X,X) = Af(p*,a 2 ), 



where 



p* = kJQ A jK M n(A + 
o* = K„-Kj(Ku 1 - 



o 2 lT l y 
)K* 



and 



Qm = K M + K MN {K + a 2 1) 1 K NM 
A = diag(A), \ n = K nn - K^K M K n , 



(17) 

(18) 
(19) 

(20) 
(21) 



where the matrices K are constructed using the covariance 
function as shown in eq. (jTTJ and we follow the following 
convention: Kab = K{Xa,Xb) and Ka = K{Xa,Xa)- 



III. GP-FNARX Model 



A. Overview 



In this section, we describe the GP-FNARX model for 
nonlinear system identification based on a nonlinear autore- 
gressive exogenous (NARX) model with filtered regressors 
(F) where the nonlinear regression problem is tackled us- 
ing Gaussian processes (GP). The approach integrates pre- 
filtering with system identification into a fully automated 
procedure that yields a Bayesian model. 

Autoregressive models with exogenous inputs attempt to 
predict future outputs by considering them a function of past 
inputs and outputs to the system: 



yt = f(yt-i,yt-2,:;U t -i,...). 



(22) 



The identification problem is then posed as a regression task. 
In other words, we want to infer the function y = f(x) 
from a finite number of examples {yi,Xi}, where xi = 



•)• 



If the input and output signals to the dynamical system 
are noisy we face an errors-in-variables problem since the 
regressors x are noisy. Noise in the regressors makes the 
regression problem particularly hard. This is one of the 
reasons why input signals are normally pre-processed before 
trying to identify a model of the system dynamics. For 
instance, one can carefully low-pass filter the signals to 
remove high-frequency noise irrelevant to the identification 
task at hand. Since we are looking for a method that avoids 
having a human in the loop, we take a data-based approach 
which consists of parameterizing the data pre-processing 
stage and optimizing its parameters jointly with the hyper- 
parameters of the Gaussian process used for regression. 

We will consider any data pre-processing function applied 
to the input and output signals 



(y,iL) = h(y,u,u) 



(23) 



where the pre-processed signals vary smoothly with respect 
to the vector of pre-processing parameters u). This smooth- 
ness condition is imposed in order to obtain a probabilistic 
model with a differentiable marginal likelihood. 

We can rephrase the autoregressive model in terms of the 
pre-processed regressors: 



Vt = f(yt-i,yt-2,--,u t -i,...). 



(24) 



Note that the left hand side term is not pre-processed. This 
will be key when optimizing the marginal likelihood of the 
model. 

B. Optimization of the Marginal Likelihood 

In section II-C we have described how the marginal 



likelihood is a very powerful way to perform model selection 
due to its ability to automatically trade off model fit and 
model complexity in a principled manner. Our goal here 
will be to maximize the marginal likelihood of the Gaussian 
process regression model with respect to the signal pre- 
pocessing parameters and also, simultaneously, with respect 
to the hyper-parameters of the GP. For convenience, we 
introduce ip = (ui, 9) grouping the two kinds of parameters. 

Rather than directly maximizing the marginal likelihood 
we will employ hill-climbing optimization to maximize 
the log marginal likelihood. For notational simplicity, we 
group all the pre-processed regressors into a matrix X — 
X(y,u,uj) so that the marginal likelihood becomes 



logp(y\X(cu),0). 



(25) 



The derivatives with respect to the GP hyper-parameters 

d 



88, 



\ogp{y\X{u),6) 



(26) 



are straightforward since we typically choose differentiable 
covariance functions. However, the derivatives with respect 
to any of the pre-processing parameters 

d 



du} k 



logp(y\X(w),0) 



(27) 



Inputs: X = {output signals y — (yi, ...,y^), input signals 
u = (ui, itjv) , model order vector 77} 

1. Vo <- InitialGuess(I) 

2. V'opt «- MaximizeMargLikelihood(V'o,I) 

3. model <- CoMPUTEFITCPREDiCTOR(V>opt,2T) 

Fig. 2. High-level pseudo-code for the GP-FNARX algorithm. 



can be more difficult to compute. Using eq. ( [To} it is clear 



that we need to find 



dK 



We can write the derivative of a 



single element of the covariance matrix as 



dKjj 



dk(xi,Xj) dk d£i dk dxj 



duj k 



(28) 



where the derivatives with respect to the regressors are 
straightforward to compute when using smooth covariance 
functions. However, the derivatives of the regressors with 
respect to the pre-processing parameters may be hard to com- 
pute. In any case, if the pre-processing function is smooth, 
the derivatives J^- can be approximated numerically by 
finite differences at the cost of one extra evaluation of the 
pre-processing function per dimension of ui. 

C. Sparse GPs for Computational Speed 

Computing the marginal likelihood for datasets with more 
than a few thousand points becomes very expensive compu- 
tationally. For larger datasets we adopt a pragmatic strategy 
whereby the marginal likelihood is maximized by employing 
only a subset of the data. 

Once the GP hyper-parameters and the data pre-processing 
parameters have been set, we use the FITC approximation 
for Gaussian processes to obtain a model of the system 
dynamics that uses the whole dataset to make predictions. 
Equation ( fTTj ) shows the formula to make predictions. After 
all possible pre-computation, the computational complexity 
is 0(M) for the mean of the model and 0(M 2 ) for its 
variance. The computational efficiency comes from having 
condensed the original dataset with N points into a smaller 
set of M inducing points. 

D. Summary of the Proposed Methodology 

In Fig. [2] we present a high-level overview of the GP- 
FNARX. The first step consists in computing an initial guess 
for the unknown parameters. We have found that a successful 
strategy is to run a few steps of optimization of the marginal 
likelihood with respect to the GP hyper-parameters followed 
by a few steps with respect to the pre-processing parameters. 
By running these two steps on a subset of the data we can 
very quickly obtain the right order of magnitude for the 
unknown parameters. 

The second step consists in a straightforward joint opti- 
mization of the GP and pre-processing parameters. This step 
can be performed by either using a subset of the data with a 
full GP or on the full dataset with the FITC approximation. 
The FITC approach is theoretically more better but we 
have obtained very good and fast results by optimizing the 
marginal likelihood computed on a subset of the data. 



Silverbox benchmark 



Wiener-Hammerstein benchmark 



GP-FNARX (SoD), 23+2 
GP-FNARX (FITC), 23+2 
GP-NARX (SoD), 14+1 
GP-NARX (FITC), 14±1 
wavenet nlarx, 5±1 
sigmoidnet nlarx, 83+9 
treepartition nlarx, 7+0 
wavenet nlarx (filt), 6+2 
sigmoidnet nlarx (filt), 74+11 
treepartition nlarx (filt), 7+0 




* 



GP-FNARX (SoD), 25+2 
GP-FNARX (FITC), 25+2 
GP-NARX (SoD), 16+1 
GP-NARX (FITC), 16+1 
wavenet nlarx, 7±3 
sigmoidnet nlarx, 85+12 
treepartition nlarx, 8+0 
wavenet nlarx (filt). 5+1 
sigmoidnet nlarx (tilt). 85+8 
treepartition nlarx (filt), 8+0 



SNR [dB] 



SNR [dB] 



Fig. 3. Root mean squared prediction error on the test set as a function of the signal to noise ratio in the output signal. The mean computation time and 
standard deviation for each method are displayed in the legend (in seconds). Experiments are repeated 10 times, the marker is positioned at the median 
value and error-bars indicate the 10-90 percentile interval. 



In the third and final step we pre-compute most of the 
FITC predictor equations ( fl"8] l and ( [T9| in order to obtain a 
model that can make mean predictions in 0(M) and variance 
predictions in 0(M 2 ). 

As an aside, we want to highlight that the choice of 
orders of the autoregressive model is not critical to the 
performance of the algorithm if two conditions are met: 

i) the order chosen is higher than the optimal order and 

ii) the automatic relevance determination (ARD) covariance 
function is chosen for the Gaussian process. The Bayesian 
Occam's razor embodied by the marginal likelihood is able to 
automatically disable regressors that are irrelevant to the task 
at hand. In our experiments we verified that adding hundreds 
of regressors on a problem of low order did not cause any 
overfitting and only represented a computation time penalty. 

IV. Related Work 

Kocijan et al. [9] presented a NARX model where the 
nonlinearity is captured using Gaussian process regression. 
More recently Gutjahr et al. [10] have proposed the use of 
FITC in order to make computation more efficient in this 
kind of models. Our model follows this approach but also 
considers that the input/output signals may be contaminated 
by noise which presents an errors-in-variables regression 
problem. 

We mitigate the errors-in-variables problem by incorporat- 
ing into the system identification the signal pre-processing 
step that is usually carried out manually [4], [11]. For 
instance, a portion of the noise can be removed by low-pass 
filtering the signals in the time domain. 

More sophisticated approaches combining smoothing and 
system identification for state space models have the po- 
tential to give better results if one is ready to accept their 
additional complexity [12], [13]. In contrast, our method 
needs to be seen as a pragmatic approach that attempts to 
offer a better solution than standard NARX models with 
a simple procedure and a limited amount of computational 
effort. 



V. Experimental Evaluation 

In this section we present an experimental evaluation of the 
proposed system identification method. We have used data 
from two nonlinear system identification benchmarks based 
on electric circuits: i) the Silverbox benchmark originating 
from the 2004 IFAC Symposium on Nonlinear Control Sys- 
tems (NOLCOS) and ii) the SYSID09 Wiener-Hammerstein 
system identification benchmark [14] which has been the 
object of a special section in the November 2012 issue of 
the "Control Engineering Practice" journal [15]. 

Both datasets are corrupted by a very small amount of 
noise. For instance, the authors of the Wiener-Hammerstein 
dataset estimate its signal to noise ratio to be of 70 dB. Since 
we are attempting to demonstrate the ability of our method 
to cope with measurement noise, we will inject different 
amounts of synthetic i.i.d. Gaussian noise to the output 
signals to make the identification task more challenging. 
Being able to have original signals with little noise will be 
convenient to test the quality of the resulting models. 

Very good algorithms have been tailored to the specific 
benchmarks at hand. However, since our goal is to provide 
an automatic method for system identification, we have 
considered that the best comparison would be against other 
off-the-shelf alternatives. We have compared our model with 
respect to several NARX models available in the Matlab 
System Identification toolbox [11]. 

Following this spirit, we have avoided any tuning of 
our method to the particular benchmarks presented in this 
section. For instance, by using knowledge about the under- 
lying system of the Silverbox benchmark, we obtained better 
performance by adding cubic regressors into our NARX 
model. However, we have not used those custom regressors 
when reporting the performance of our model. 

Regarding the pre-processing step, we have chosen a 
simple zero-phase second order Butterworth low-pass filter. 
In this case, the filter parameter to represents the cut-off 
frequency of the filter. 



In Figure [3] we have plotted the prediction errors on both 
benchmarks for a number of different models and signal-to- 
(synthetic)-noise ratios. We have tested the following models: 

• GP-FNARX (SoD): the model presented in this paper using 
a subset of 512 points chosen randomly (subset of data 
approximation). 

• GP-FNARX (FITC): the model presented in this paper 
using a FITC approximation with M = 512 inducing 
points chosen randomly. 

. GP-NARX (SoD): same as GP-FNARX (SoD) but with no 

pre-filtering of the signals. 
. GP-NARX (FITC): same as GP-FNARX (FITC) but with no 

pre-filtering of the signals. 

• * nlarx: 3 different NARX models implemented in 
the Matlab System Identification toolbox [11]. Default 
options. No pre-filtering of the signals. 

• * nlarx (filt): same as * nlarx but using the pre-filtering 
parameters taken from GP-FNARX (FITC) (the computa- 
tion time for computing those parameters is not taken 
included in the reported figure). 

All models have order 10 for the inputs and the outputs. 

We observe that the GP-FNARX method with FITC ap- 
proximation provides the lowest error for all noise levels in 
both benchmarks. An interesting pattern that is apparent in 
Figure |3]consists in the FITC methods performing better than 
their SoD peers for the Silverbox benchmark whereas, for the 
Wiener-Hammerstein benchmark, the models with signal pre- 
processing (GP-FNARX) perform better than their unfiltered 
equivalents. 

Overall, these results allow us to be optimistic with regards 
to the prediction capabilities of GP-FNARX models and 
validate the use of the marginal likelihood as a criterion for 
model selection in the context of automated pre-processing 
of the data. 

The legend of Figure [5] reports the computation times 
of the experiments when performed on a machine with a 
2008 Intel Core i7-920 processor at 2.67 GHz. Although 
the training time of the GP-FNARX model is higher than 
the wavenet and treepartition models, it is significantly faster 
than sigmoidnet yet it provides a lower prediction error. 

GP models have a one degree of freedom knob which 
can be used to trade off accuracy with speed: the number 
of points used for the SoD or FITC methods. Increasing the 
number of points is not equivalent to increasing the number 
of parameters in a parametric model and does not imply any 
risk of overfitting. 

VI. Conclusions 

We have presented GP-FNARX, an automated approach 
for the identification of nonlinear systems based on autore- 
gressive models implemented with Gaussian processes. Our 
approach combines the data pre-processing step with the 
identification task per se. By maximizing the marginal likeli- 
hood of the probabilistic model we obtain a fitting procedure 
that naturally balances model fit and model complexity. 

The Gaussian process model resulting from the identifica- 
tion belongs to the family of Bayesian nonparametric mod- 



els. As such, it does not rely on a rigid parametric functional 
form and has the capability to adapt to arbitrarily complex 
data. Moreover, the models can report the uncertainty present 
in their own predictions. For instance, if the model is used in 
an operating region which is far from the training data, the 
model will report higher uncertainty. This feature is useful 
in applications such as robotics or adaptive control where 
different control strategies may be appropriate depending on 
the confidence in the predictions. 

One of the reasons why Bayesian system identification is 
not widely used is the lack of methods with a competitive 
computational cost. We believe that our method presents a 
realistic alternative to parametric nonlinear function approxi- 
mators: we obtain good performance with a comparable cost 
and we add the ability to quantify the uncertainty in the 
model. In addition, our method is simple to apply due to the 
automation of the pre-processing stage. 

We have made available code implementing GP-FNARX 
in our website in an attempt to make Bayesian system 
identification more widespread. 
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